Posters - Schedules

Posters Home

View Posters By Category

Wednesday, November 9, 2022 between 8:30 AM - 9:30 AM
Thursday, November 10, 2022 between 8:30 AM - 9:30 AM
Friday, November 11, 2022 between 8:30 AM - 9:30 AM
Virtual: A surprising loss for a Recurrent Neural Networks
COSI: dream
  • Michele Tinti, The Wellcome Centre for Anti-Infectives Research School of Life Sciences University of Dundee, United Kingdom


Presentation Overview: Show

I developed an end-to-end procedure to predict the expression of genes from random promoters using Jupyter Notebook and TensorFlow. My approach uses Recurrent (GRU) and Convolution Neural Networks to regress the strength of the targeted promoters using information encoded in the forward and reverse DNA strands. The starting point of my model, two bi-bidirectional GRUs, have been recently used in a machine learning competition at Kaggle to predict the stability of RNA vaccines. In this work, I expanded on this architecture and found that the addition of convolutions and fully connected layers can efficiently extract features from DNA sequences and predict gene expression. Surprisingly, the training of this model benefits from using Binary Crossentropy loss coupled with the sigmoid activation of the output layer.

Virtual: DREAM Challenge 2022 Predicting gene expression using millions of random promoter sequences by Team Wan&Barton_BBK
COSI: dream
  • Ibrahim Alsaggaf, Birkbeck, University of London, United Kingdom
  • Patrick Greaves, Birkbeck, University of London, United Kingdom
  • Carl Barton, Birkbeck, University of London, United Kingdom
  • Cen Wan, Birkbeck, University of London, United Kingdom


Presentation Overview: Show

In this competition, we proposed a modified Temporal Convolutional Networks (TCN) and a new loss function to train our predictive model for predicting the expression profiles of random promoter sequences. In general, TCN is constructed as a type of stacked neural networks, where each hidden layer has the same length as the input layer in order to guarantee that the prediction for the target time point depends on all previous time points' information. The well-known dilated convolutional operation was also exploited in order to expand the receptive field when coping with long input sequences. The residual connection, weight normalization, and spatial dropout mechanisms were also used to construct the convolutional layer. In our TCN model, the input promoter sequence's nucleotides were encoded as a type of character-level embeddings, which are parameterized as the learnable first level of TCN. The final output is generated by a linear layer that was added on top of the last residual block in order to predict the expression profiles by using the learned feature representations of the entire target promoter sequence's nucleotide combinations. We also proposed a new loss function for training our TCN model for this competition. The new loss function consists of Mean Squared Error (MSE), Pearson correlation, and Spearman correlation values, whilst a weight for the MSE is applied in order to reflect the importance of those two types of correlations.

We used the preprocessed training dataset (including 6,728,720 sequences) to train our model. As TCN can be trained on variable length inputs, our model is trained on 105,123 batches that were sequentially created by 23 bins, where each batch holds sequences of the same length. We used the weekly leaderboard testing sequences to briefly estimate the predictive performance of our model. Because of the highly noisy distribution of the training dataset, we believe using all training sequences would lead to the best model generalizability w.r.t. the testing sequences. Note that, due to the potential bias of the leaderboard testing sequences (since they were randomly sampled as ∼13% of the entire testing dataset), we didn't adopt any early-stopping strategy based on the model performance evaluated by using the leaderboard testing sequences, which were nevertheless used to conduct a hyperparameters optimization. According to the 1st stage leaderboard results, our model successfully outperformed two benchmark methods.

Virtual: Predicting Gene Expression Using a Residual CNN
COSI: dream
  • Fredrik Svensson, University College London, United Kingdom
  • Maria-Anna Trapotsi, University of Cambridge, United Kingdom
  • Susanne Bornelöv, University of Cambridge, United Kingdom


Presentation Overview: Show

Our study describes the "Camformers" team's submission to the DREAM challenge "Predicting gene expression using millions of random promoter sequences". The objective of the challenge was to predict reporter gene expression in S. cerevisiae based on a 110 nucleotide (nt) DNA sequence representing a minimal promoter. The sequences consisted of an 80 nt variable region embedded between two fixed sequences of length 17 and 13 nt, respectively. In total, 6,739,258 training examples were available and used for modelling if they fulfilled quality criteria based on the expected length and number of undetermined bases in the sequence. We used one-hot-encoded sequences for model input. To identify the best performing model, different architectures were created in both PyTorch and TensorFlow, and explored by varying model parameters and hyperparameters using Optuna. In our hands, the best and most stable performance was obtained in PyTorch using a convolutional neural network (CNN) with residual connections. The model included six convolutional layers with three residual connections allowing the model to bypass every other layer. Batch normalisation and dropouts were used for regularisation and a max pooling operation was added after the penultimate convolutional layer to reduce model size and improve generalisation. The output of the final convolutional layer was flattened into 13,312 features and fed into a block of two dense layers outputting 256 features, followed by a final dense layer outputting the predicted expression level. All layers except the last used a rectified linear unit activation. The whole model had 16,611,073 trainable parameters and total runtime from raw data to predictions was ~16 hours using one core on a Google Cloud TPU v3-8. Model performance during training was r=0.748 (r²=0.560) and ρ=0.765 on our internal validation set (10%). Final performance on the challenge leaderboard (13% of external test data) was r=0.962 (r²=0.926), ρ=0.967, ScorePearsonR²=0.763, and ScoreSpearman=0.823, outperforming previously published methods. In the final challenge, our team "Camformers" finished in 4th place. In addition to presenting our final submission, we will also describe our insights from design and optimisation of alternative architectures, including CNN and transformer models.